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Abstract 

Exome sequencing has been widely used in detecting pathogenic nonsynonymous single nucleotide variants (SNVs) for 
human inherited diseases. However, traditional statistical genetics methods are ineffective in analyzing exome sequencing 
data, due to such facts as the large number of sequenced variants, the presence of non-negligible fraction of pathogenic 
rare variants or de novo mutations, and the limited size of affected and normal populations. Indeed, prevalent applications 
of exome sequencing have been appealing for an effective computational method for identifying causative 
nonsynonymous SNVs from a large number of sequenced variants. Here, we propose a bioinformatics approach called 
SPRING (Snv PRioritization via the INtegration of Genomic data) for identifying pathogenic nonsynonymous SNVs for a given 
query disease. Based on six functional effect scores calculated by existing methods (SIFT, PolyPhen2, LRT, MutationTaster, 
GERP and PhyloP) and five association scores derived from a variety of genomic data sources (gene ontology, protein- 
protein interactions, protein sequences, protein domain annotations and gene pathway annotations), SPRING calculates the 
statistical significance that an SNV is causative for a query disease and hence provides a means of prioritizing candidate 
SNVs. With a series of comprehensive validation experiments, we demonstrate that SPRING is valid for diseases whose 
genetic bases are either partly known or completely unknown and effective for diseases with a variety of inheritance styles. 
In applications of our method to real exome sequencing data sets, we show the capability of SPRING in detecting causative 
de novo mutations for autism, epileptic encephalopathies and intellectual disability. We further provide an online service, 
the standalone software and genome-wide predictions of causative SNVs for 5,080 diseases at http://bioinfo.au.tsinghua. 
edu.cn/spring. 
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Introduction 

Pinpointing genetic variants underlying human inherited 
diseases is the primary step towards the understanding of the 
pathogenesis of these diseases [1]. With the accelerating advance- 
ment of the next generation sequencing technology, it becomes an 
efficient strategy to selectively sequence coding regions of a 
genome, resulting in the exome sequencing technique [2]. With 
the increase of sequencing throughput and the decrease of 
sequencing costs, exome sequencing has been widely used in not 
only the detection of pathogenic variants for Mendelian diseases 
[3-5] but also the discovery of susceptible loci for complex diseases 
[6-8]. 

A majority of genetic variants captured by exome sequencing 
studies are nonsynonymous single nucleotide variants (SNVs), 
whose occurrences may change structures of encoded proteins, 
thereby affecting functions of proteins and further causing diseases 
[5]. It has been shown that among the large number (typically 
around 8,000-10,000) of nonsynonymous SNVs sequenced in an 



exome, a significant fraction occurs with low minor allele 
frequency (MAF^ 1 %), belonging to the category of rare genetic 
variation [9,10]. Recent studies have also shown that a non- 
negligible fraction of disease-causing SNVs occur de novo, 
representing the most extreme form of rare variants [11-13]. 
The existence of such rare or de novo mutations, together with the 
fact that the number of affected and normal individuals being 
sequenced is typically quite limited, has been obstructing direct 
applications of such traditional statistical genetics methods as 
family-based linkage analysis and population-based association 
studies to the analysis of exome sequencing data [14]. Indeed, 
prevalent applications of exome sequencing have been appealing 
for an effective computational method for the identification of 
pathogenic variants from a large number of sequenced nonsynon- 
ymous SNVs [1,5]. 

To meet the requirement in the analysis of exome sequencing 
data, existing methods for predicting functional implications of 
nonsynonymous SNVs have been borrowed. These methods, with 
examples including SIFT [15], PolyPhen2 [16], LRT [17], 
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Author Summary 

The detection of causative nonsynonymous single nucle- 
otide variants (SNVs) is essential for the understanding of 
the pathogenesis of human inherited diseases. In this 
paper, we propose a statistical method called SPRING {Snv 
PRioritization via the INtegration of Genomic data) to 
combine six functional effect scores calculated by existing 
methods and five association scores derived from multiple 
genomic data sources to estimate the statistical signifi- 
cance that a nonsynonymous SNV is pathogenic for a 
query disease. We find that SPRING is effective in 
identifying disease-causing SNVs for diseases whose 
genetic bases are either partly known or completely 
unknown across a variety of inheritance styles. With real 
exome sequencing data, we show the qualified potential 
of SPRING in not only the detection of causative SNVs in 
simulation studies but also the identification of pathogenic 
de novo mutations for autism, epileptic encephalopathies 
and intellectual disability. 

MutationTaster [18], GERP [19], PhyloP [20], and many others 
[21-25], typically predict damaging effects of a nonsynonymous 
SNV on the function of its hosting protein based on individual or 
combined use of such information as sequence properties [15], 
structure characteristics [16] and database annotations [18]. 
Genome-scale prediction results of these methods have also been 
collected in databases such as the dbNSFP [26]. However, for a 
specific query disease, the alteration of the function of a gene 
hosting a variant does not necessarily mean that the variant is 
pathogenic for the query disease. For example, in the UniProtKB/ 
Swiss-Prot database [27] (release 20 1 2_09), there have been 37 
SNVs reported in gene ABCB4 (corresponding to multidrug 
resistance protein 3, MDR3_HUMAN). Among these variants, 1 1 
(e.g., p.Argl50Lys) has been reported to be causative for 
intrahepatic cholestasis of pregnancy (MIM: 147480) [28,29], 3 
(e.g., p.Gly983Ser) causative for progressive familial intrahepatic 
cholestasis type 3 (MIM: 602347) [30-32], and 6 (e.g., 
p.Ala934Thr) causative for gallbladder disease type 1 (MIM: 
600803) [33,34]. Therefore, in order to access whether a 
nonsynonymous SNV is causative for a query disease, it is not 
enough to only predict the functionally damaging effects of the 
variant — the association information between the disease and the 
gene hosting the variant is also important. 

With this understanding, we propose in this paper a statistical 
method called SPRING (Snv PRioritization via the INtegration of 
Genomic data) for the detection of pathogenic nonsynonymous 
SNVs for a given query disease in exome sequencing studies. 
Given a query disease and a set of candidate nonsynonymous 
SNVs, SPRING calculates a (/-value for each candidate variant to 
indicate the statistical significance that the variant is causative for 
the query disease and thus provides a means of prioritizing the 
candidate variants. SPRING achieves this goal by using a rigorous 
statistical model to integrate six functional effect scores that are 
calculated by SIFT, PolyPhen2, LRT, MutationTaster, GERP 
and PhyloP to indicate the functional implication of a nonsynon- 
ymous SNV and five association scores that are derived from gene 
ontology, protein-protein interactions, protein sequences, protein 
domain annotations and gene pathway annotations to describe the 
potential association between the variant and the query disease. 
The integrated />-values are further converted to (/-values for 
addressing the multiple testing correction problem [35,36], We 
perform a series of comprehensive validation experiments to access 
the effectiveness of SPRING. Results show that our method is 



valid for diseases whose genetic bases are either partly known or 
completely unknown, effective for diseases with a variety of 
inheritance styles, and capable of identifying disease-causing SNVs 
in whole-exome sequencing studies. We further show the 
capability of our method in detecting causative de novo mutations 
for autism, epileptic encephalopathies, and intellectual disability. 

Results 

Principles of the proposed method 

The computational assessment of functional implications of 
nonsynonymous SNVs has been usually formulated as a task of 
predicting functionally damaging effects of such SNVs. To this 
end, existing methods predict the potential impact of a nonsynon- 
ymous SNV on the function of its host protein based on such 
information as sequence properties [15], structure characteristics 
[16], and database annotations [18]. The principle behind these 
methods is that a functionally damaging SNV usually raises a 
significant change on the structure and function of the host 
protein, and the sequence at the mutation position is more 
conserved, while a neutral SNV typically results in a minor or 
negligible change in protein structure and function, and the 
sequence of the resulting protein is less conserved. 

The computational identification of disease genes has been 
typically modeled from the viewpoint of one-class prioritization. 
Given a query disease and a list of candidate genes, existing 
methods rank candidate genes according to their strength of 
association with the query disease. This is usually done according 
to the guilt-by-association principle [37], which assumes that genes 
related to the same disease are correlated in their functions, and 
such correlation can be calculated from such genomic data as gene 
sequences [38], gene functional annotations [39], protein-protein 
interactions [40], etc [41,42]. 

However, for a specific query disease, the alteration of the 
function of a gene hosting an SNV does not necessarily mean the 
association between the gene and the query disease, as we have 
analyzed previously that the SNVs occurring in ABCB4 may cause 
three diseases intrahepatic cholestasis of pregnancy, progressive 
familial intrahepatic cholestasis type 3 and gallbladder disease type 
1 . On the other hand, the association between a gene and a query 
disease does not mean every functional variant in the gene is 
causative for the query disease. For example, SCN2A (MIM: 
182390) has been validated to be causative for autism (MIM: 
209850) [43] and early infantile epileptic encephalopathy- 1 1 
(EIEE11, MIM: 613721) [44,45] but benign familial neonatal- 
infantile seizures-3 (BFIS3, MIM: 607745) [46,47]. Nevertheless, 
based on the dbSNP database, among the 37 SNVs found in 
SCN2A, only 7 are detected to be pathogenic, and the other 30 
are not reported to be causative for any disease up to now [48]. 
Therefore, in order to access whether an SNV is causative for a 
query disease, one needs to integrate both the functional 
implication of the variant and the association information between 
the disease and the gene hosting the variant. Based on this 
reasoning, we model the identification of causative SNVs for a 
query disease from a set of candidate SNVs as a prioritization 
problem and propose a computational approach called SPRING 
to address this problem. 

Specifically, as illustrated in Figure 1 , given a query disease and 
a set of candidate SNVs, SPRING calculates a (/-value for each 
candidate variant to indicate the statistical significance that the 
variant is causative for the query disease and thus provides a 
means of ranking the candidate variants. SPRING achieves this 
goal by using a statistical method called Fisher's combined 
probability test with dependence correction to integrate six 
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functional effect /(-values that characterize the functional implica- 
tion of a variant and five association /j-values that describe the 
potential association between the variant and the query disease. 
The six functional effect /f-values are derived from existing 
approaches for prediction functional implications of SNVs, 
including SIFT [15], PolyPhen2 [16], LRT [17], MutationTaster 
[18], GERP [19], and PhyloP [20]. The five association /-values 
are derived from genomic data sources, including gene ontology, 
protein-protein interactions, protein sequences, domain annota- 
tions, and pathway annotations. To address the multiple testing 
correction problem, /j-values resulting from the Fisher's method 
[49] are further converted to ^-values to control the positive false 
discovery rate (pFDR) [35,36] . 

Data sources 

We extracted a list of 1,436 diseases from the OMIM database 
(accessed in November 2012) [50] and downloaded a total of 1,206 
genes associated with these diseases using the tool BioMart [51] 
(Table SI). 

We downloaded from the UniProtKB/Swiss-Prot database [27] 
(release 2012_09) 23,320 disease-causing SNVs annotated as 
"Disease" and 37,193 neutral SNVs annotated as "Polymor- 
phism." We downloaded functional effect scores for SNVs 
(calculated by SIFT, PolyPhen2, LRT, MutationTaster, GERP 
and PhyloP) from the dbNSFP database [26] (version 2.0b4), 
which included prediction scores for not only missense mutations 
but also nonsense mutations. Matching these two data sources, we 
obtained 12,610 disease-causing SNVs and 23,403 neutral SNVs 
with at least one functional effect score available. We downloaded 
exome sequencing data of eight HapMap individuals [52] that 
represented three populations (Europe, Asia and Africa) and 
derived a set of SNVs for each individual (Text SI). 

We downloaded protein-protein interaction data from the 
STRING database [53] (Version 9.0). Focusing on high confident 
interactions (confidence scores greater than or equal to 0.9), we 
extracted 9,966 proteins and 116,648 interactions between the 
proteins. We downloaded the gene ontology (GO) and associated 



annotations (accessed on November 2, 2012). Focusing on the 
biological process domain, we calculated semantic similarity 
between 14,283 genes. We downloaded sequences of human 
proteins from the UniProt database (release 2012_09) and 
calculated their pairwise similarities using SSEARCH [54]. 
Focusing on e-values less than or equal to 0.001, we obtained 
20,281 proteins and 912,018 similarities between them. We 
downloaded annotations of protein structural domains from the 
Pfam database [55] (version 26.0). Focusing on the manually 
curated part (Pfam- A), we collected 1,066 domains that were 
contained in at least five human proteins and derived pairwise 
similarities between 12,713 proteins that contained at least one of 
these domains. We downloaded annotations of 232 pathways from 
the KEGG database [56] (release 58.0) and derived pairwise 
similarities between 5,951 genes accordingly. 

Performance on diseases with partly known genetic 
bases 

We validated SPRING for diseases whose genetic bases were 
partly known (i.e., some genes had been annotated as associated 
with the diseases). To simulate this situation, we extracted 1 1 3 
diseases annotated as associated with two or more genes from the 
OMIM database (Table SI). Taking each of these diseases as the 
query disease, we collected its causative SNVs from the Swiss-Prot 
database [27] to obtain a set of test SNVs, and we ranked each of 
them against three sets of control SNVs, including (1) a neutral 
control set composed of SNVs annotated as "Polymorphism" in 
the Swiss-Prot database, except for those selected for estimating 
functional effect /(-values, (2) a disease control set consisting of 
SNVs annotated as "Disease" in the Swiss-Prot database, except 
for those in the test set, and (3) a combined control set obtained as 
the union of the neutral and the disease sets. The disease control 
set was used to assess the capability of our method in distinguishing 
variants causative for the query disease from those causative for 
other diseases but irrelevant to the query one. The combined 
control set was used to simulate the real situation in which an 
individual might carry not only neutral variants but also variants 



Query disease 

Autism 

Candidate SNVs 



Functional effect scores Functional effect p-values Ranking list 



Chi 


Pos 


Ref 


Alt 


Gene 


22 


42294709 


C 


G 


SREBF2 


1 


23236941 


C 


T 


EPHB2 


18 


43479407 


C 


T 


EPG5 


17 


556648 


G 


A 


VPS53 


2 


166210819 


G 


T 


SCN2A 


2 


166201379 


C 


A 


SCN2A 


16 


49764733 


A 


C 


ZNF423 


X 


41524649 


C 


G 


CASK 


5 


14394220 


A 


T 


TRIO 



22 42294709 


CG 


SIFT: 


0.0100 


PolyPhen2: 


0.9980 


MutationTaster 


1.0000 


LRT: 


0.0000 


GERP: 


4.3800 


PhyloP: 


1.2580 



Candidate genes 

SREBF2, EPHB2, ... 
Seed genes 

r- 1^ APBA2, CHD8, ... 
Genomic data 

Gene ontology 
Protein interactions 
Protein sequences 
Domain annotations 
Pathway annotations 



22 42294709 C G 
SIFT: 0.1510 
PolyPhen2: 0.0388 
MutationTaster: 0.1107 
LRT: 0.0205 
GERP: 0.2790 
PhyloP: 0.3300 



GO: 
PPI: 

Sequence: 

Domain: 

Pathway: 



0.0009 
0.0923 
0.5205 
0.5341 
NaN 



Association p-values 

22 42294709 C G 



Chr 



Pos 



Ref Alt Gene Rank (/-value 



2 


166210819 


G 


T 


SCN2A 


1 


1.47E-10 


2 


166201379 


C 


A 


SCN2A 


2 


3.40E-10 


1 


23236941 


C 


T 


EPHB2 


3 


7.28E-9 


X 


41524649 


C 


G 


CASK 


4 


8.19E-4 


5 


14394220 


A 


T 


TRIO 


5 


4.00E-3 


22 


42294709 


C 


G 


SREBF2 


6 


3.95E-2 


16 


49764733 


A 


C 


ZNF423 


6 


3.95E-2 


18 


43479407 


C 


T 


EPG5 


8 


1.25E-1 


17 


556648 


G 


A 


VPS53 


9 


1.31E-1 



Figure 1. Workflow of SPRING. Given a query disease and a set of candidate SNVs as inputs, SPRING calculates a q-value for each candidate and 
generates a ranking list of the candidates as the output. A q-value is calculated by using Fisher's method with dependence correction to integrate six 
functional effect p-values and five association p-values. 
doi:10.1371/journal.pgen.1004237.g001 
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responsible for diseases other than the query one. In the 
calculation of functional effect /(-values, we partitioned SNVs 
annotated as "Polymorphism" in the Swiss-Prot database into two 
equal parts at random and used one part to estimate the null 
distribution and the other part as the neutral control set. In the 
calculation of association /(-values, we selected seed genes for the 
query disease as genes annotated as associated with the disease, 
except for the one hosting the test SNV. 

We summarized ranks of the test SNVs in Figure 2 (A— C). 
There are a total of 1,501 disease SNVs annotated as causative for 
the 113 diseases. In the validation against the neutral control set 
(1 1,702 SNVs), SPRING ranks 1,161 test SNVs among top 10 and 
1,306 among top 50. In contrast, with a random guess procedure, 
one could only expect 10x1,501/11,702=1.28 test SNVs 
enriched among top 10 and 6.41 among top 50. In the validation 
against the disease control set (12,605 SNVs), SPRING ranks 185 
test SNVs among top 10 and 653 among top 50, while random 
guess can only enrich 1.19 test SNVs among top 10 and 5.95 
among top 50. In the validation against the combined control set 
(24,307 SNVs), SPRING ranks 164 test SNVs among top 10 and 
628 among top 50, while random guess can only enrich 0.62 test 
SNVs among top 10 and 3.09 among top 50. These results suggest 
the capability of our method in identifying SNVs causative for 
diseases whose genetic bases are partly known. 

We then derived two criteria to quantify the performance of our 
method. Dividing the rank of a test SNV by the total number of 
candidates, we obtained the rank ratio of the SNV. Averaging 
rank ratios of all test SNVs for a query disease, we obtained the 
first criterion called the Mean Rank Ratio (MRR). At a certain 
threshold of the rank ratio, we defined the sensitivity as the 



fraction of test SNVs ranked above the threshold, and specificity as 
the fraction of control SNVs ranked below the threshold. Varying 
the threshold, we plotted the rank operating characteristic (ROC) 
curve (sensitivity versus 1 -specificity) and further calculated the 
area under this curve as the second criterion called the AUC score. 
As shown in Figure 3 (A and B), the average MRR and AUC for 
the 113 diseases are 0.0071 and 0.9930 respectively in the 
validation against the neutral control set, 0.0466 and 0.9535 
respectively in the validation against the disease control set, and 
0.0275 and 0.9725 respectively in the validation against the 
combined control set. These results further suggest the effective- 
ness of our method, considering that random guess can only yield 
an MRR of 0.5 and an AUC of 0.5. 

We analyzed the statistical significance of candidate SNVs and 
found that (/-values of test SNVs were much smaller than those of 
control ones. Hence, when ranking a test SNV against control 
ones according to their ^-values, the test SNV was likely to be 
ranked among top positions (Text SI). We further assessed 
whether the number of seed genes affected the performance of our 
method and found this factor having little influence (Text SI). 

Performance on diseases of unknown genetic bases 

We validated SPRING for diseases whose genetic bases were 
unknown (i.e., no genes had been annotated as associated with the 
diseases). To simulate this situation, we extracted a total of 1,436 
diseases annotated as associated with at least one gene from the 
OMIM database (Table SI). Taking each of these diseases as the 
query disease and pretending that the genetic basis of this disease 
was unknown, we collected annotated causative SNVs of the 
disease to obtain test SNVs, and we ranked each of them against 
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Figure 2. Rank distributions of the test SNVs. (A-C) Results for diseases with partly known genetic bases when validating against the neutral, 
disease, and combined control sets, respectively. (D-F) Results for diseases of unknown genetic bases when validating against the neutral, disease, 
and combined control sets, respectively. 
doi:1 0.1 371 /journal.pgen.1 004237.g002 
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Figure 3. Performance of SPRING in the validation experiments. (A) and (B) MRRs and AUCs for diseases with partly known genetic bases, 
respectively. (C) and (D) MRRs and AUCs for diseases of unknown genetic bases, respectively. 
doi:1 0.1 371 /journal.pgen.1 004237.g003 



the three control sets described previously. Different from the 
validation for diseases with partly known genetic bases, we 
identified 10 diseases that had the highest phenotype similarities 
to the query disease according to pre-calculated pairwise 
phenotype similarity scores between 5,080 diseases [57] and used 
genes known as associated with these disease as seed genes for 
calculating association ^-values. 

We summarized ranks of the test SNVs in Figure 2 (D— F). There 
are a total of 12,610 disease SNVs annotated as causative for the 
1,436 diseases. In the validation against the neutral control set 
(1 1,702 SNVs), SPRING ranks 5,703 test SNVs among top 10 and 
7,635 among top 50, while random guess can only rank 10.78 test 
SNVs among top 10 and 53.88 among top 50. In the validation 
against the disease control set (12,605 SNVs), SPRING ranks 454 
test SNVs among top 10 and 1,785 among top 50, while random 
guess can only enrich 10.00 test SNVs among top 10 and 50.02 
among top 50. In the validation against the combined control set 
(24,307 SNVs), SPRING ranks 435 test SNVs among top 10 and 
1,748 among top 50, while random guess can only enrich 5.19 test 
SNVs among top 10 and 25.94 among top 50. These results 
suggest the capability of our method in identifying SNVs causative 
for diseases whose genetic bases are unknown. 

We then summarized MRRs and AUCs for individual diseases 
in Figure 3 (C and D). The average MRR and AUC for all 1,436 
diseases are 0.0440 and 0.9560 respectively in the validation 
against the neutral control set, 0.2037 and 0.7964 respectively in 
the validation against the disease control set, and 0.1265 and 
0.8735 respectively in the validation against the combined control 



set. We analyzed the statistical significance of candidate SNVs and 
found that the (/-values of test SNVs were also much smaller than 
those of control ones (Text SI). We further assessed the influence 
of the number of neighboring diseases and found that our method 
was robust to this parameter (Text SI). 

These results demonstrate the effectiveness of SPRING in 
dealing with diseases whose genetic bases have not been 
deciphered yet and thus have no information about associated 
genes. On one hand, since phenotypically similar diseases may 
have genetic overlap [58], it could be a feasible way to borrow 
genes known as associated with diseases of high phenotype 
similarities to a query disease to facilitate the inference of SNVs 
causative for the query disease. On the other hand, we also notice 
the drop in performance when compared with the results in the 
previous section. We suppose the reason is that seed genes selected 
here may not be as reliable as those for diseases with partly known 
genetic bases. 

Prediction power for diseases of different genetic styles 

We assessed the prediction power of SPRING for diseases with 
different genetic styles. We first classified the 1,436 diseases into a 
group of 1,378 Mendelian diseases and a group of 58 complex 
diseases according to the Genetic Association database (released in 
November, 2012) [59] (Table SI). Results show that SPRING can 
recover disease-causing SNVs for both groups (Figure 4). For 
example, in the validation against the combined control set, the 
average MRR and AUC for Mendelian diseases are 0.1272 and 
0.8729 respectively, and those for complex diseases are 0.1 1 15 and 
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Figure 4. Performance of SPRING for diseases of different inheritance styles. (A) MRRs when validating against the neutral, disease, and 
combined control sets, respectively. (B) AUCs when validating against the neutral, disease, and combined control sets, respectively. 
doi:10.1371/journal.pgen.1004237.g004 



0.8886 respectively. The two-sided Wilcoxon rank sum test 
suggests that the performance of our method on these two 
categories of disease is not significantly different {p-value = 0.51 18). 

We then classified the 1,378 Mendelian diseases into three 
categories according to their inheritance patterns, obtaining 396 
autosomal dominant diseases (MIM: lxxxxx), 468 autosomal 
recessive diseases (MIM: 2xxxxx), and 109 X-linked diseases 
(MIM: 3xxxxx). The rest of 405 diseases (MIM: 6xxxxx) are not 
included in the comparison. Results show that SPRING can recover 
disease-causing SNVs for all these three classes of diseases (Figure 4). 
For example, in the validation against the combined control set, the 
average MRR and AUC are 0.09617 and 0.9039 respectively for 
autosomal dominant diseases, 0.1457 and 0.8543 respectively for 
autosomal recessive diseases, and 0. 1545 and 0.8455 respectively for 
X-linked diseases. Pairwise two-sided Wilcoxon rank sum tests 
suggest that the performance on autosomal dominant diseases is 
significantly different from that on both autosomal recessive diseases 
(p-value = 3.897 x 10 -9 ) and X-linked diseases (p-vaiue — 
1.643x10 3 ), while the performance on the latter two classes is 
not significantly different (p-value = 0.7942). 

We further identified 48 immune diseases and 263 neurological 
disorders and found that our method was also capable of recovering 
disease-causing SNVs for both classes of diseases (Figure 4). For 
example, in the validation against the combined control set, the 
average MRR and AUC are 0.1822 and 0.8178 respectively for 
immune diseases, and 0.1112 and 0.8889 respectively for neuro- 
logical disorders. The two-sided Wilcoxon rank sum test suggests 
that the performance of our method on these two classes of diseases 
is of marginal difference (p-value = 0.0159). 

Prediction power for rare SNVs 

One of the superiorities of exome sequencing is the capability of 
finding disease-causing rare SNVs for a query disease. To 
demonstrate the effectiveness of SPRING in identifying causative 
rare SNVs, we collected 932 causative rare SNVs with minor allele 
frequency (MAF) less than 0.01 from the dbSNP database [48] and 
identified a total of 444 diseases annotated as caused by these 
variants. For each of these rare SNVs, we pretended that the 
genetic basis of the corresponding disease was unknown, and we 
ranked the SNV against the three control sets described 



previously. Results show that the MRR and AUC are 0.0340 
and 0.9660 respectively in the validation against neutral controls, 
0.1676 and 0.8324 respectively in the validation against disease 
controls, and 0.1029 and 0.8971 respectively in the validation 
against combined controls. All these results support the effective- 
ness of SPRING in the identification of disease-causing rare SNVs. 
We then compared distributions of functional effect scores for 
these rare SNVs with those for the same number of SNVs selected 
at random from a HapMap individual. Results show that the rare 
SNVs are typically assigned more extreme functional effect scores 
than SNVs occurring in a normal individual (Text SI). Since the 
sequence conservation property has been used in the derivation of 
the functional effect scores, we conjecture that the effectiveness of 
our method in this validation experiment can be partly attributed 
to the rarity of such rare mutations in a random human. 

Prediction power of individual data sources 

We assessed prediction power of individual data sources by 
repeating the validation experiment for diseases of unknown genetic 
bases using a single data source. Results, as summarized in Table 1, 
show that functional effect data sources are effective in the 
discrimination of disease-causing SNVs against neutral controls 
but are ineffective in distinguishing such SNVs from disease 
controls, and thus these data sources show low effectiveness in 
distinguishing causative SNVs from combined controls. For 
example, the MRR and AUC for SIFT are 0.1792 and 0.8205 
respectively in the validation against neutral controls, 0.5015 and 
0.4984 respectively in the validation against disease controls, and 
0.3816 and 0.6183 respectively in the validation against combined 
controls. ROC curves (Figure 5, dotted lines) also support this 
observation. It is not surprising to see the effectiveness of these data 
sources in the validation against neutral controls, since the power of 
the mechanism used for calculating functional effect scores has been 
verified in numerous studies [15-25], and our Rvalue transforma- 
tion strategy does not affect the comparison of such scores. The 
ineffectiveness of these data sources in the validation against disease 
controls can be attributed to the absence of disease-specific features 
to identify which exactly disease a variant is associated with. As a 
result, functional effect scores lack the power of discriminating 
between variants causing different diseases. 
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Table 1. Performance of individual data sources. 













Neutral 


Disease 


Combined 




Data source 


Neutral MRR {%) 


Disease MRR (%) 


Combined MRR (%) 


AUC (%) 


AUC (%) 


AUC (%) 


Coverage (%) 


SIFT 


17.92 


50.15 


38.16 


82.05 


49.84 


61.83 


96.39 


PolyPhen2 


13.56 


49.76 


36.29 


86.44 


50.23 


63.70 


100.00 


LRT 


24.04 


49.90 


40.28 


75.94 


50.08 


59.71 


91.62 


MutationTaster 


20.60 


49.44 


38.71 


79.40 


50.55 


61.28 


91.20 


GERP 


23.73 


48.92 


39.55 


76.26 


51.07 


60.44 


99.92 


PhyloP 


23.51 


48.18 


39.00 


76.49 


51.81 


60.99 


100.00 


GO 


14.56 


22.58 


17.69 


85.44 


77.42 


81.30 


98.68 


PPI 


14.62 


24.78 


19.87 


85.39 


75.22 


80.13 


88.72 


Sequence 


31.69 


32.91 


32.32 


68.31 


67.09 


67.68 


99.72 


Domains 


38.40 


38.04 


38.21 


61.60 


61.96 


61.79 


76.81 


Pathways 


30.73 


41.77 


36.43 


69.27 


58.23 


63.57 


60.24 


All 


4.40 


20.37 


12.65 


95.60 


79.64 


87.35 


100.00 



The coverage of a data source is defined as the proportion of SNVs having the score calculated from the data source. Note that in the calculation of the criteria, we 
pooled validation results for all query diseases instead of considering individual diseases separately. 
doi:1 0.1 371 /journal.pgen.1 004237.t001 



Table 1 also shows that association data sources exhibit medium 
effectiveness in distinguishing disease-causing SNVs from neutral, 
disease, and combined controls. For example, the MRR and AUC 
for GO are 0.1456 and 0.8544 respectively in the validation 
against neutral controls, 0.2258 and 0.7742 respectively in the 
validation against disease controls, and 0.1769 and 0.8130 
respectively in the validation against combined controls. This 
observation is also supported by ROC curves (Figure 5, solid lines) 
and can be explained as follows. The association score assigned to 
a variant is calculated based on the gene hosting the variant. 
Therefore, SNVs, regardless of their functional implications, will 
be assigned identical scores as long as they occur in the same gene. 
This reasoning, together with the fact that distributions of 
association scores for neutral and disease controls are not 
significandy different (Text SI), results in the above observation. 

When compared with a single data source, the integration of all 
1 1 data sources demonstrates much lower MRRs (0.0440, 0.2037 



and 0. 1265 in the validation against neutral, disease and combined 
controls, respectively) and much higher AUCs (0.9560, 0.7964 and 
0.8735 in the validation against neutral, disease and combined 
controls, respectively), suggesting the effectiveness of data integra- 
tion. This conclusion is also supported by Figure 5 in that the ROC 
curves for integrating all data sources (black solid lines) climb much 
faster than those for individual data sources towards the top left 
corner of the plot. Besides, the coverage of our method also benefits 
from data integration. For example, in the validation experiments, 
only 88.72% SNVs have PPI information, and the coverage for 
pathway data is even as low as 60.24%. With the integration of 
multiple data sources, however, the causative effect of a variant for a 
query disease can be predicted as long as the variant appears in a 
data source, and thus the coverage of our method is extended to the 
union of variants included in individual data sources. 

Considering that the data sources are correlated, the prediction 
power of an individual data source may not reflect its real 



A 1 



B 1 



C 1 




0.5 
1 -Specificity 



0.5 
1 -Specificity 



0.5 

1 -Specificity 



SIFT PolyPhen2 LRT MutationTaster GERP PhyloP 

GO PPI Sequence Domain Pathway All data 



Figure 5. ROC curves of individual data sources. (A) Results when validating against the neutral control set. (B) Results when validating against 
the disease control set. (C) Results when validating against the combined control set. 
doi:1 0.1 371 /journal.pgen.1 004237.g005 
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contribution to the final performance of our method. We therefore 
evaluated relative contribution of a data source by erasing scores 
derived from the data source and repeating the validation 
experiment for diseases of unknown genetic bases against the 
combined control set. We calculated the proportion of test SNVs 
whose rank ratio increased after the removal of a data source to 
measure the relative contribution of the data source. Obviously, 
the larger the value of this criterion, the higher the relative 
contribution of a data source. As shown in Table 2, relative 
contributions of different data sources are quite different. For 
functional effect data sources, SIFT has the highest contribution, 
followed by LRT. For association data sources, GO and PPI both 
have high contributions, followed by pathway information. The 
results also suggest that diseases in different categories prefer 
different data sources. For example, GO, PPI, pathway and SIFT 
show much higher contributions for complex diseases than for 
Mendelian diseases, while PPI has much lower contribution for 
immune diseases than for neurological disorders. 

We further adopted a sequential backward selection (SBS) 
strategy to select a subset of data sources for each of the 1,436 
diseases. Results (Text SI and Table S2) show that the selection 
procedure can improve the performance of SPRING. However, 
different diseases show different preferences on the selected data 
sources, and such preferences are diverse. We therefore suggest 
either seeking for the simplicity to use all data sources without 
selection or resorting to a cross-validation experiment to select a 
subset of data sources for a query disease when there is no strong 
prior knowledge indicating the preference of the disease to the 
data sources. In the reset of this paper, we use all data sources 
without selection by default. 

Estimation of the false positive rate and true positive rate 

The ^-values calculated by SPRING and can be used in two 
ways. First, for a set of candidate SNVs, their ^-values can be used 
as bases for prioritizing the SNVs. Second, for a single SNV, its q- 
value can be used to predict whether the SNV is causative for a 
query disease. We therefore assessed whether the false positive rate 
(FPR) and true positive rate (TPR) can be controlled at a desired 
level at a given j-value threshold. 

For a HapMap individual, we calculated j-values for SNVs 
reported in the individual for each of the 1,436 diseases and derived 
the FPR for a disease as the proportion of SNVs whose ^-values are 

Table 2. Relative contributions of individual data sources. 



less than or equal to a threshold. Results show that the TFP can be 
well controlled (Figure 6). For example, at the Rvalue thresholds 
0.1, 0.05, 0.01 and 0.005, the average FPRs for all diseases are 
7.16%, 4.94%, 2.01% and 1.34%, respectively. We notice these 
numbers are greater than those obtained using the neutral control 
set (average FPRs at the above j-value thresholds are 4. 18%, 2.43%, 
0.71% and 0.48%, respectively), and we suppose the reason behind 
this phenomenon is that some variants occurring in these HapMap 
subjects may actually be related to some diseases [60] . We further 
performed a simulation study by embedding different proportions of 
disease SNVs into the neutral control set and found that both FPR 
and TPR could also be well controlled (Text SI). 

Simulation studies for exome sequencing data 

We assessed the effectiveness of SPRING in identifying disease- 
causing SNVs in real exome sequencing data. For this purpose, we 
generated a large number of synthesized exomes by inserting each 
SNV causing one of the 1,436 diseases into the exome of a 
Hapmap individual, and we applied SPRING to rank the 
embedded SNV against the other SNVs in each synthesized 
exome. Results, as summarized in Figure 7, demonstrate the 
effectiveness of our method in distinguishing disease-causing SNVs 
from those occurring in normal individuals. For example, 
According to Figure 7 (A), for the eight individuals, 45.18%— 
49.19% causative SNVs are ranked among top 10, and 70.89%- 
74.15% are ranked among top 50. According to Figure 7 (B and 
C), the average MRRs for the eight individuals range from 0.0225 
to 0.0237 (the median MRRs range from 0.0025 to 0.0029), and 
the average AUCs range from 0.9764 to 0.9775 (the median 
AUCs range from 0.9972 to 0.9978). These results suggest that our 
method is effective in finding true disease-causing SNVs in exome 
sequencing studies. 

Detection of causative nonsynonymous de novo 
mutations for autism, epileptic encephalopathies and 
intellectual disability 

De novo mutations are genetic variants that are not inherited 
from parents. As the most extreme form of rare genetic variation, 
de novo mutations have been subjected to less stringent evolutionary 
selection pressure and thus are usually more functionally 
damaging than inherited genetic variants [11]. Facilitated by the 



Data source 


All diseases (%) 


Mendelian diseases (%) 


Complex diseases (%) 


Immune diseases (%) 


Neurological disorders (%) 


SIFT 


53.56 


52.71 


60.86 


53.61 


52.49 


PolyPhen2 


35.07 


34.94 


36.18 


35.26 


33.47 


LRT 


49.75 


49.15 


54.81 


53.20 


55.28 


MutationTaster 


35.42 


34.96 


39.36 


38.14 


37.11 


GERP 


26.73 


26.73 


26.80 


23.30 


26.67 


PhyloP 


24.73 


25.41 


18.93 


27.42 


24.20 


GO 


74.46 


73.00 


86.90 


75.88 


77.18 


PPI 


74.22 


73.34 


81.76 


45.77 


80.54 


Sequence 


31.05 


32.40 


19.45 


22.06 


27.56 


Domains 


37.20 


37.31 


36.26 


32.16 


38.67 


Pathways 


61.86 


59.36 


83.27 


68.45 


64.42 



The contribution of a data source is defined as the proportion of test SNVs whose rank ratios increase after the removal of the data source. Validation experiments are 
conducted against the combined control set. 
doi:1 0.1 371 /journal.pgen.1 004237.t002 
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Figure 6. Estimated false positive rates under different g-value cut-offs when using neutral SNVs in the Swiss-Prot database and 
exomes of the eight HapMap individuals as negative control sets. 

doi:10.1371/journal.pgen.1004237.g006 



whole-exome sequencing technique, recent studies have shown 
that individual de novo germline mutations occurring in single genes 
could be the major cause of rare Mendalian diseases such as 
Schinzel-Giedion syndrome [61], Kabuki syndrome [62] and 
Bohring-Opitz syndrome [63] . Moreover, recent studies have also 
suggested that the collection of de novo mutations affecting different 
genes in different individuals might explain a proportion of 
common complex diseases such as schizophrenia [13,64], autism 
[65-69], epileptic encephalopathies [70], and intellectual disability 
[7,71,72]. 

To demonstrate the power of our method in diagnosing disease- 
causing de novo mutations, we applied SPRING to a whole-exome 
sequencing data set of autism (PMID 22495306 [65]). From the 
literature [65], we collected 135 and 87 unique candidate 
nonsynonymous de novo mutations from the exome sequencing 
data of probands and siblings respectively, and 214 of these 



mutations can be mapped to the dbNSFP database. With the 
criterion that a seed gene should have been reported as associated 
with autism (MIM: 209850) by independent studies before the 
publication of this data set, we selected from the OMIM database 
a total of 34 seed genes and then applied SPRING to prioritize the 
candidate mutations (Table S3). 

In the literature [65] , a gene SCN2A was reported as associated 
with autism, and two probands each carried a nonsense de novo 
mutation in this gene. SPRING assigned very significant (/-values 
(1.47x10 10 and 3.40x10 10 ) to these two mutations and ranked 
them first and second in a list of 214 candidates. Considering that 
the probability of ranking these two mutations at the top by a 

Y214 N 



random guess procedure is only 1 



V 2 



±4.39 x 10-\ the 



capability of our method in identifying causative de novo mutations 
in this application is strongly supported. 
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Figure 7. Validation results for synthesized exomes. (A) Distributions of ranks for test SNVs. (B) Boxplots of MRRs. (C) Boxplots of AUCs. 
doi:10.1371/journal.pgen.1004237.g007 



We noticed that SCN2A had been previously annotated as 
causative for autism by an independent study [43] and thus was 
included in seed genes in the above analysis. Although this 
situation is consistent with the application of SPRING in finding 
novel causative variants occurring in known causative genes, 
another important application would require the identification of 
causative variants occurring in genes not yet studied. To simulate 
this scenario, we removed SCN2A from the seed genes and 
prioritized the candidate mutations again. We found that the two 
mutations were now ranked third and sixth. Considering that the 
probability of ranking these two mutations among top 6 by chance 

is only f^^J / ( ^g^l «6.58x 10~ 4 , the capability of our 

method in detecting novel causative de novo mutations in this 
application is also supported. 

We further extended applications of SPRING to three other 
whole-exome sequencing data sets of autism [67-69], one data set 
of epileptic encephalopathies [70] and two data sets of intellectual 
disability [71,72]. Detailed analyses were given in Text SI and 
Table S3. 

Large scale prediction of causative nonsynonymous SNVs 
for 5,080 diseases 

We further performed a large scale prediction of causative SNVs 
for a total of 5,080 diseases in the phenotype similarity matrix [57] . 
Focusing on SNVs collected in both the dbSNP and dbNSFP 
databases, we extracted a total of 174,394 SNVs that occurred in 
genes described by at least one of the five genomic data sources. We 
then used SPRING to prioritize all these candidate SNVs and 
distinguished suspicious causative ones for each disease. For a query 
disease whose genetic basis was partly known, we selected genes 
annotated as associated with the disease as seed genes. For a query 
disease whose genetic basis was completely unknown, we selected 
genes annotated as associated with 10 diseases of the highest 
phenotype similarities with the query disease as seed genes. 
Prediction results for the 5,080 diseases, together with an online 
service and the standalone software of SPRING, are available at 
http:/ /bioinfo.au.tsinghua.edu.cn/ spring. 



Discussion 

In this paper, we formulate the identification of causative 
nonsynonymous SNVs for a query disease as a prioritization 
problem, and we propose a method called SPRING to solve this 
problem by integrating multiple genomic data sources. We 
demonstrate the superior performance of our method by 
conducting a series of validation experiments, showing that our 
method is valid for diseases whose genetic bases are either pardy 
known or completely unknown, effective for diseases with a variety 
of inheritance styles, and capable of identifying disease-causing de 
novo mutations in whole-exome sequencing studies. 

The success of our method can be attributed to a combination 
of several aspects. First, we take the advantage of both functional 
implications of candidate SNVs and potential associations between 
a query disease and genes hosting the SNVs. In contrast, existing 
methods for predicting functional effects of SNVs or prioritizing 
candidate genes utilize only part of such information and can 
hardly achieve the goal of identifying causative variants. Second, 
we ground the inference of causative variants on a rigorous 
statistical model for integrating multiple data sources, not only 
explicidy considering correlations between the data sources, but 
also carefully controlling statistical significance of prediction 
results. As a result, our method does not rely on a single data 
source to make inference and is capable of maintaining a relatively 
low false positive rate at a reasonably high true positive rate. 

Certainly, our method can be further improved in the following 
directions. First, we resort to the phenotype similarity profile to 
collect seed genes for diseases of unknown genetic bases, and 
hence the quality and coverage of such a profile directly determine 
the performance and scope of applications of our method. 
Following the direction of data integration, it might be helpful 
to incorporate multiple phenotype similarity profiles such as those 
calculated based on the human phenotype ontology [73] into our 
method. Along this direction, how to utilize these phenotype 
similarity profiles in a more sophisticated way will be a question 
worth exploring. 

Second, our method currendy uses five genomic data sources to 
infer the association relationship between a query disease and a 
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gene hosting a variant. With the development of high-throughput 
experimental techniques, more and more genome-wide functional 
genomics data such as transcriptional regulation, microRNA 
regulation, DNA methylation and histone modification will be 
available. How to include these data sources in our method will be 
an important aspect. 

Finally, we only focus on SNVs in protein coding regions in this 
paper. However, genetic variants occurring in splicing sites, 
promoter regions, introns, and other parts of a genome are also 
important in the pathogenesis of a disease. How to extend our 
method to enable the identification of disease-causing variants in 
these genomic regions will be an important direction. 

Methods 

Calculation of functional effect p-values 

We derive six /(-values to assess the statistical significance that a 
candidate SNV is causative for a query disease from the viewpoint 
of whether the SNV has damaging effect on the function of the 
gene containing the SNV. For this purpose, we first extract from 
the Swiss-Prot database [27] SNVs annotated as "Polymorphism" 
to obtain a set of neutral SNVs that show no damaging effect on 
functions of their host genes. Then, for each of the six 
bioinformatics tools for predicting functional implications of SNVs 
(SIFT [15], PolyPhen2 [16], LRT [17], MutationTaster [18], 
GERP [19], and PhyloP [20]), we collect predictive scores for the 
neutral SNVs from the dbNSFP database [26] and estimate the 
empirical distribution of the scores. Finally, given a candidate 
SNV, we extract its predictive score for each of the six tools and 
calculate a /(-value as the probability that a neutral SNV is 
assigned at least the same extreme score as that of the candidate 
SNV. 

Specifically, SIFT relies on multiple sequence alignments 
generated using PSI-BLAST [74] to calculate predictive scores 
for SNVs. The smaller a SIFT score, the stronger the evidence 
that an SNV is functionally damaging. Therefore, for SIFT, we 
calculate the probability that a neutral SNV is assigned a 
predictive score smaller than or equal to that of the candidate 
SNV as the functional effect p- value for the candidate. PolyPhen2 
integrates sequence, structure and annotation information to build 
a classification model that predicts the probability that an SNV is 
functionally damaging. The greater a PolyPhen2 probability, the 
stronger the evidence that an SNV is functionally damaging. 
Therefore, we calculate the probability that a neutral SNV is 
assigned a predictive probability larger than or equal to that of the 
candidate SNV as its functional effect /(-value. Methods for 
calculating /(-values for LRT is similar to that for SIFT, and the 
other three tools (MutationTaster, GERP, and PhyloP) are similar 
to that for PolyPhen2. 

Calculation of association p-values 

We derive five /(-values to assess the statistical significance that a 
candidate SNV is causative for a query disease from the viewpoint 
of whether the gene containing the variant is associated with the 
query disease. For this purpose, we first rely on phenotype 
similarities between diseases [57] and known associations between 
diseases and genes to obtain a set of seed genes for the query 
disease. Meanwhile, we calculate five functional similarity scores 
between every pair of genes based on a variety of genomic data 
sources (gene ontology, protein-protein interactions, protein 
sequences, domain annotations, and pathway annotations). Then, 
for each functional similarity, we resort to the guilt-by-association 
principle [23] to calculate a score, indicating the strength of 
association between the gene hosting the candidate SNV (i.e., 



candidate gene) and the query disease. Finally, we convert the 
association score to a /(-value by estimating the probability that a 
non-disease related gene is assigned at least the same extreme 
score as that of the candidate gene. 

In detail, given a candidate SNV, we obtain the candidate gene 
for the SNV by mapping the SNV back onto the genome and 
identifying the gene hosting the SNV. Given a query disease, we 
obtain seed genes for the disease by two means. First, if the genetic 
basis of the query disease has been studied and hence there exist 
some genes annotated as associated with the disease, we use these 
genes related to the disease as seed genes. Second, if the genetic 
basis of the query disease is unknown and thus these is no gene 
having been annotated as related to the disease, we sort all diseases 
except for the query one according to their similarity scores to the 
query disease in non-increasing order and select genes known as 
associated with top 10 diseases in the ranking list as seed genes. 
Obviously, the later strategy can also be used to complement the 
former when the number of genes annotated as a query disease is 
limited. 

The five pairwise functional similarity scores for genes are 
calculated as follows. For the network similarity score, we extract 
9,966 proteins and 116,648 high confidence interactions (with 
confidence scores greater than or equal to 0.9) between the 
proteins from the STRING database [53] (version 9.0) to obtain a 
protein-protein interaction network. Then, we calculate the 
diffusion kernel of this network as K = exp( — y(D — A)), where y 
is a free parameter controlling the magnitude of diffusion, D a 
diagonal matrix containing node degrees, and A the adjacency 
matrix of the network. In our study, we follow the literature [75] to 
choose y = 0.01. Finally, we define the network similarity between 
two genes i and j as the corresponding element k tJ in the diffusion 
kernel. 

For the semantic similarity score, we focus on 18,850 terms in 
the biological process domain of the gene ontology (GO, released 
on November 2, 2012) [76] and 186,080 annotations regarding 
14,283 genes to calculate the semantic similarity between every 
pair of genes using the method of Resnik [77], as detailed in one of 
our previous studies [39]. 

For the sequence similarity score, we extract 20,281 protein 
sequences from the UniProt database (release 2012_09), use the 
Smith- Waterman algorithm [78] implemented in SSEARCH [54] 
to perform a local sequence alignment for every pair of protein 
sequences (with default parameters and the e-value cut-off of to 
0.001), and obtain the similarity scores by dividing the negative 
logarithmic transformed e-values by the maximum of all 
transformed e-values. 

For the domain similarity score, we extract from the Pfam 
database (version 26.0) [55] 13,672 protein families. Focusing on 
domains with at least 5 human proteins annotated, we obtain 
1,066 domains. Then, we denote a human protein as a 1,066 
dimensional binary vector, with a dimension representing the 
presence or absence of a domain in the protein. Finally, we 
calculate the similarity measure of two proteins as the cosine of the 
angle of the corresponding vectors. 

For the pathway similarity score, we extract from the KEGG 
database (release 58.0) [56] 16,662 annotations of 5,951 proteins 
and 232 pathways. Then, we denote a protein as a 232 
dimensional binary vector, with a dimension representing the 
presence or absence of the protein in a pathway. Finally, we 
calculate the similarity measure of two proteins as the cosine of the 
angle of the corresponding vectors. 

Given a candidate gene, a set of seed genes and a type of gene 
functional similarity, we calculate the association score for the 
candidate gene according to the guilt-by-association principle by 
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summing over similarities between the candidate gene and the 
seed genes. Furthermore, we convert the association score to a p- 
value by estimating the probability that a non-disease related gene 
is assigned at least the same extreme score as that of the candidate 
gene. Performing these steps for each of the five gene similarities, 
we obtain five /-values to indicate degrees that the candidate 
gene containing the candidate SNV is associated with the query 
disease. 



with py being the sample correlation coefficient between the test 
statistics [49]. It has been shown that the maximum difference 
between py and the unbiased estimator is less than 0.001 when 
«>36, and the maximum error between Cov(K,-,P}) and its 
approximation given above is no more than 0.00019 [49]. 
Matching the mean and variance of the population and the 
sample, r and v can be estimated as 



Integration of multiple p-values by Fisher's method with 
dependence correction 

We adopt Fisher's combined probability test [79] to combine 
the /(-values derived above and obtain a single /(-value that 
indicates the statistical significance that a candidate SNV is 
causative for the query disease. Given k /(-values to be combined, 
denoted as p\ . . .pk, we calculate a Fisher's combination test 
statistic Tas 

k 

T= -2^2 log/;,, 

i=i 

It is evident that Thas an asymptotic chi-squared distribution with 
2k degrees of freedom when all null hypotheses are true. The final 
combined /rvalue can then be calculated accordingly. 

However, the above Fisher's method assumes independence of 
all /-values to be combined, which is obviously not true in our 
problem. Therefore, we further apply a dependence correction 
strategy [49] to adjust the combined /(-value. Briefly, Yang et al. 
assumes that the null distribution of the Fisher's combination test 
statistic T follows a scaled chi-squared distribution with v degrees 
of freedom (rjtj) when the /(-values to be combined are not 
independent and suggests to estimate the parameters r and v using 
the method of moments [49]. In detail, with definitions 
V(= — 2 log/?,- and T= Xw=i the population mean and the 
variance of the T statistic are derived as 

fi T = rv and a 1 T = 2r 2 v, 

respectively, and the corresponding sample mean and variance are 
calculated as 

fi T = 2k and a 2 T = 4k + 2 Cov ( v u Vj), 

<<j 

respectively. The covariance Co\{Vj,Vj) can be calculated 
approximately as, 



Cov( V u Vj) « cnpy + a 2 p ?. + a 3 pl - (a 4 /n)(l - p|) 2 , 



with ai =3.263119, a 2 = 0.709866, a 3 =0.026589, a 4 = 0.709866, 
Py an unbiased estimator of the correlation coefficient between the 
two test statistics used to derive /(-values /?,- and pj, and n the 
sample size in the calculation of py. Furthermore, f>y can be 
calculated approximately as 



r = 1 + — Cov ( V i> v .j) and * = 2k /' r - 

The adjusted /(-value can then be calculated according to the 
scaled chi-squared distribution (r/?) [49]. In our studies, we use 
a set of SNVs annotated as "Polymorphism" in the Swiss- 
Prot database to estimate the sample correlation coefficients 

(ft,)- 

It is possible that some data sources are absent for a candidate 
SNV. To deal with this missing data problem, we ignore the 
missing data source in the calculation of the Fisher's test statistic 
and the adjusted /(-value. The total number of /(-values to be 
combined will then decrease accordingly. 

We further perform multiple testing corrections on the adjusted 
/(-values by calculating (/-values for candidate SNVs. Briefly, 
Storey et al. [35,36] proposed to control the positive false 
discovery rate (pFDR, the expected proportion of false positives 
among all significant hypotheses, given at least one hypothesis 
having been rejected) in a multiple testing problem and put 
forward a method to calculate (/-values from /(-values. Numerical 
studies have shown the significant improvement of the test 
power by controlling pFDR using the this method [35,36] 
instead of controlling the false discovery rate (FDR) using the 
traditional step-up procedure of Benjamini-Hochberg [80]. 
Therefore, in our study, we adopt ^-values to measure the 
statistical significance that an SNV is causative for a query 
disease. 
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